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ABSTRACT 

In this tutorial, we discuss several practical issues regarding specification and solution of dependability and 
performability models. We compare model types with and without rewards. Continuous-time Markov chains 
(CTMCs) are compared with (continuous-time) Markov reward models (MRMs) and generalized stochastic 
Petri nets (GSPNs) are compared with stochastic reward nets (SRNs). It is shown that reward-based models 
could lead to more concise model specification and solution of a variety of new measures. With respect to the 
solution of dependability and performability models, we identify three practical issues; largeness, stiffness, 
and non-exponentiality, and we discuss a variety of approaches to deal with them, including some of the 
latest research efforts. 


1 This research was partially supported by the National Aeronautics and Space Administration under NASA Contract No. 
NASl-19480 while the first two authors were in residence at the Institute for Computer Applications in Science and Engineering 
(ICASE), NASA Langley Research Center, Hampton, VA 23681. 
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1 Introduction 


Dependability, performance, and performability evaluation techniques provide a useful method for under- 
standing the dynamic behavior of a computer or communication system. To be useful, the evaluation 
should reflect important system characteristics such as fault-tolerance, automatic reconfiguration, and re- 
pair; contention for resources; concurrency and synchronization; deadlines imposed on the tasks; and graceful 
degradation. Furthermore, complexity of current-day systems and corresponding system evaluation should 
be explicitly addressed. 

Traditional performance evaluation is concerned with contention for system resources. Performance 
evaluation of parallel and distributed systems also address concurrency and synchronization of tasks. Real- 
time system performance evaluation takes into account various hard and soft deadlines on task exection 
times. 

Reliability, availability, safety, and related measures are collectively known as dependability. Depend- 
ability evaluation encompasses fault- tolerance, reconfiguration, and repair aspects of system behavior. More 
recently, interest in combining performance and dependability evaluation has grown. Such performability 
evaluation considers the graceful degradation of the system in addition to the dependability aspects. 

While measurement is an attractive option for assessing an existing system or a prototype, it is not a 
feasible option during the system design and implementation phases. Model-based evaluation has proven to 
be an attractive alternative in these cases. A model is an abstraction of a system that includes sufficient 
detail to facilitate an understanding of system behavior. Several types of models are currently used in 
practice. The most appropriate type of model depends upon the complexity of the system, the questions to 
be studied, the accuracy required, and the resources available for the study. 

Discrete-event simulation is the most commonly used modeling technique in practice but it tends to be 
relatively expensive. Analytical modeling provides a cost-effective alternative to simulation for studying the 
performance and dependability of computer and communication systems. Due to recent developments in 
model generation and solution techniques and automated tools, large and realistic models can be developed 
and studied. In this tutorial we concentrate on such analytic models. The rest of this tutorial is organized as 
follows. In the next section, we present an overview of various approaches to dependability and performance 
modeling. In Section 3, we show how performability analysis can be carried out using MRMs. We also 
show how dependability measures can be obtained via performability analysis using special reward rate 
assignment. 

In Section 4, we compare GSPNs and stochastic reward nets. In Section 5, we discuss in detail some prac- 
tical issues in solving dependability and performability models: largeness, stiffness, and non-exponentiality. 
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Figure 1: A reliability block diagram model. 


2 Approaches to Modeling 

2.1 Dependability Modeling 

Reliability block diagrams, fault trees, and reliability graphs are commonly used to study the dependability 
of systems [59]. Although these models are concise and have efficient solution methods, they cannot represent 
dependencies among components [56] as easily as CTMC models can [21, 23]. 

We begin by considering a fault- tolerant, multi-processor computer with multiple, shared memory mod- 
ules. The system is able to detect* a processor or memory module failure and reconfigure itself to continue 
operation without the failed component. The system can operate with just one processor and one memory 
module. 

Our first model of this system is the reliability block diagram in Figure 1, We could attach to each 
component the probability of having failed by a particular time. In a more general parameterization, a 
failure time distribution function, rather than a probability value, can be attached to each component. For 
example, one can assign the exponential distribution F p (t) = 1 — to processors and F m (t) = 1 — e“ Amt 
to memories. We can request the system failure time distribution as a function of the time variable t . For a 
system with two processors and three memory modules, 

F sy ,(t) = 1 - (1 - (1 - e"V)2)) • (1 - (1 - e- A "‘) 3 )) . 


We can also ask for the mean time to system failure, 


MTTF. 


ays 


=/v 


Fsys(t)dt 
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Now suppose we want to investigate a different computer design where the two processors have fast private 
memory modules and the system has slower, shared memory modules. We assume that the system operates 
as long as there is at least one operational processor with access to either a private or shared memory. We 
cannot model this system with a block diagram, because there is no way to model how the shared memories 
are connected to all processors while private memories are connected to particular processors. So, we turn 
to a fault tree model, shown for two processors and three memory modules in Figure 2. We could also use 
a reliability graph, where time-to-failure distributions are assigned to the edges. The system is operational 
as long as there is a path from source (src) to sink. In this particular model (Figure 3), processor failures 
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Figure 2: A fault tree model. 



Figure 3: A reliability graph model. 

happen along the edges labeled PI and P2 and memory failures happen along the edges Ml> M2 i and M3. 
The edges II and 12 do not represent system components; they represent the structure of the system (the 
sharing of M3). We assign the “infinite” distribution', defined by I(t) = 0, to them. There is a path from 
source to sink if PI and Ml are up or if PI and M3 are up, and similarly for paths involving P2. Analysis 
of the reliability graph results in the same failure time distribution as the fault tree analysis. 

Now we extend our models to take into account repair or replacement of parts. We calculate the “avail- 
ability” of the system, the (transient or steady-state) probability that the system is functioning. We examine 
the all-shared-memory system and look at three repair strategies: 

1. There are enough repair resources to repair all components at the same time, if necessary. 

2. There are two repair facilities, one for processors and one for memory modules, each able to handle 
one component at a time. 

3. There is one repair facility, able to handle one component at a time. Processor repair has preemptive 
priority over memory repair. 

For the first strategy, the state of the components (either up or down) are mutually independent, since 
the failure and repair of each component does not depend on that of any other component. Because of this 
independence, we can use the block diagram used to model reliability (Figure 1) to model availability as well. 
Instead of assigning to each component the time-to-failure distribution, we use the transient unavailability. If 
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Figure 4: A CTMC model. 


the i th component has exponentially distributed failure behavior with rate A, and repair is also exponentially 
distributed with rate /i;, its unavailability at time t is 


Ui(t) = 


A, 

a, + m 


A, + Hi 


( 1 ) 


and the steady-state unavailability is given by 


lim Ui(t) = 

t — *00 


A> 

A» + /ii 


These expressions can be derived by solving the two-state (up/down) CTMC for a component [62]. 

If we analyze the reliability block diagram of Figure 1 with the assignment of distribution functions of 
Equation 1 to the components, the resulting function is the system unavailability at time t ) U S y $ {i)^ and the 
“mass at infinity” (1 - Hindoo U sys (t)) is the steady-state system availability. 

To deal with the second and third repair strategies, we can no longer use the block diagram model. The 
block diagram assumes that all components are statistically independent, but, if components share repair 
facilities, the failure and repair behavior of one component is dependent on the state of all components. 

If the failure and repair distributions are exponential, we can use a CTMC model. Consider the CTMC 
in Figure 4. State mp represents the system when m memory units and p processors are functional. The 
model with all of the solid and dashed-line transitions is for the second repair strategy (one repair facility for 
processors and one for memories). The model for the third strategy (only one repair facility giving priority 
to the processors) is obtained by excluding the dashed lines, since no memory is repaired while there are 
failed processors. 

We note that we could have used a CTMC for the first repair strategy as well. We would have assigned 
different transition rates to the repair transitions to reflect the fact that more than one component can be 
repaired at a time. As an example, the rate for the transition from 02 to 12 would be 3 * rather than 
fi m . The block diagram model, though, is both easier to construct and more efficient to analyze. 
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Figure 5: A GSPN model. 

Before leaving the subject of unavailability, we illustrate the use of one more model type, the GSPN. For 
a discussion of this model type, the reader is referred to [1]. Modeling the availability of this system with a 
GSPN does more than just give us another validity check. It allows us to find the unavailability for a system 
with any number of processors and memories without having to construct a separate model for each number 
of components. The GSPN in Figure 5 is a model of the system in which there is one repair facility to be 
shared for all components. 

There is a token for each processor and each memory. Initially, there are n v tokens in the place ppup (place: 
processors up) and n m tokens in the place pmup (place: memories up). When a processor fails, its token 
moves from place ppup through transition tpfail (transition: processor fails) to place pprep (place: processor 
waiting for repair). Processor repair is represented by a token moving from place pprep through transition 
tprep to place ppup. The inhibitor arcs from pprep to tmfail and pmrep to tpfail reflect the assumption that 
if the system has already failed because all processors or all memories have failed, the remaining working 
components do not fail while they are not running. This aspect of the system was modeled only implicitly in 
the CTMC model, by the absence of failure transitions from the places with either no operating processors 
or no operating memory modules. The inhibitor arc from pprep to tmrep is the one that represents our 
assumption that there is only one repair facility; if there are any failed processors, there can be no memory 
repair. 

We can verify that analyzing this GSPN with n p = 2 and n m = 3 gives the same result for system steady- 
state unavailability as the CTMC model. We note that the GSPN, although a more efficient specification, 
is no more efficient to analyze than the CTMC, since analysis of a GSPN involves translating the GSPN 
into a CTMC. However, dependability modeling with GSPN tends to be clumsy [41]. Stochastic reward nets 
remove this restriction from GSPN models. We elaborate more on this in Section 4. 

2.2 System Performance Models 

In this section, we look at aspects of system performance, including performance of gracefully degraded 
systems. In the performance domain, task precedence graphs [31, 55] can be used to model the perfor- 
mance of concurrent programs with unlimited resources. Product form queueing networks [35, 36], on the 
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Figure 6: A product form queueing network for the system with three shared memories. 



Figure 7: A product form queueing network for the system with one shared memory and two local memories. 

other hand, can represent contention for resources. However they cannot model concurrency within a job, 
synchronization, or server failures, since these violate the product form assumptions. 

We consider the same two system architectures as in Section 2.1: the first containing two processors 
and three shared memory modules and the second containing two processors, each with a private memory 
module, and one shared memory module. 

To capture the effects of contention for the processor and memory resources, we use queueing network 
models. We assume that the memory modules are servers in the sense that they queue requests and perform 
block transfers. To set up a realistic queueing model, we would have to take into account the proposed oper- 
ating system design, especially the scheduling aspects, and we would need some kind of expected workload 
characterization. For the sake of illustration, we use the closed queueing network models shown in Figures 
6 and 7. 
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The network in Figure 6 is for the design containing two processors and three shared memory modules. 
We model the two processors by a multiple-server station. That is, jobs wait in a single queue and enter 
whichever server becomes free. When a job wants to access the memory, it requires memory module Mi with 
probability pr*. After some visits to the processor, a job finishes: pro is the probability that a job is finished 
when it leaves one of the processors. As is usual for closed queueing networks, the assumption is that each 
finished job is replaced by a statistically identical new job. 

The network in Figure 7 is for the design containing two private memory modules. For this system, we 
assume that jobs are targeted to particular processors. This is reasonable, since, once a job starts on a 
processor, we want it to continue where it has access to that processor’s private memory. We carry out this 
assumption by making the queueing network a “multiple-chain” queueing network, in this case having two 
“chains”, or classes of jobs. Jobs in the first class go from PI to either Ml or Ms and back to PI and jobs 
in the second class go from P2 to either M2 or Ms and back to P2. 

As expected, the system with private memories provides higher system throughput as opposed to that 
for the shared-memory system. 

To model the systems when one memory has failed, we remove the server Ml (and its queue) from each 
of the models and adjust the probabilities pr, and pr t; - appropriately. 

Queueing models are able to capture the effects of resource contention, but measures related to the total 
number of jobs serviced do not capture the performance of the system as seen by a single parallel program: 
series-parallel acyclic graph models [55] can be used for this purpose. 

Also CTMCs provide a useful framework to model system performance, but a detailed CTMC model 
is often large and complex and its construction is an error-prone process. Hence there is a need for a 
higher-level model-type having an underlying CTMC, which is then automatically generated from it. Some 
attempts in the specific instance of dependability modeling have resulted in useful packages like SAVE [23], 
for availability modeling, which uses a block diagram input, and HARP [21], for reliability modeling, which 
uses a fault-tree input. A suitable interface is necessary for a more general modeling environment. GSPNs 
[1] and SRNs [13] provide an excellent interface for detailed performance modeling of complex systems. 

The advent of fault-tolerant computing has resulted in the design of machines which continue to function 
even in the presence of failures, albeit at a reduced level of performance. Pure reliability or performance 
models of such systems do not capture the whole picture. This has prompted researchers to consider the 
combined evaluation of performance and reliability [44, 63]. The CTMC is extended by associating rewards 
with its states to obtain a “Markov reward process”, or “Markov reward model” (MRM). This process not 
only facilitates modeling of performance and reliability but also the combined evaluation of performance and 
reliability. Since this paper considers the automatic generation of the CTMC from the GSPN description of 
the model, the reward structure must also be defined in terms of the GSPN entities. Consequently the GSPN 
description is modified to obtain “stochastic reward nets” [13] which can be automatically transformed to 
obtain the underlying MRM. 
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3 CTMCs versus MRMs 


CTMCs have been traditionally used to model dependability. MRMs [25] are CTMCs in which reward rates 
may be associated with states of the CTMC {rale-type rewards) or with transitions of the CTMC {impulse- 
type rewards) or both. We consider MRMs with rate-type rewards. MRMs have been successfully used for 
performability analysis [44, 63] according to the following methodology. Initially, a dependability model (also 
known as structural model) of the system is constructed. Assuming the dependability model is state-space 
type (such as a CTMC), a performance measure is obtained (possibly by solving a performance model) for 
each state of the dependability model. This performance measure becomes the reward rate of that state 
in the dependability model. With the reward-rate assignment, the dependability model becomes an MRM 
which may then be solved for various performability measures. There is an approximation involved in this 
decomposition of performance and dependability models: the system is assumed to have attained (quasi- 
)steady-state in each state of the dependability model, so that the reward rate for each state of the reliability 
model is a steady-state performance measure. Transient or steady-state analysis of the dependability model 
with rewards is then carried out. The justification for this decomposition lies in the fact that the performance 
activities are much faster than the dependability events. 

CTMCs can also be used for performability analysis if a monolithic model is constructed which combines 
both the dependability and performance model of the system. However, the state-space of this model is 
approximately the cross-product of state-spaces of the dependability and performance models. In addition, 
this monolithic model is stiff because of extreme disparity between the transition rates (job arrival rates 
could be 10 9 times or more than the fault occurrence rates). One may argue that this approach is more 
accurate than the MRM approach since no approximation is involved. However, this gain in accuracy may 
well be negated due to the computational problems posed by largeness and stiffness of the monolithic model. 
We focus more on these two problems, largeness and stiffness, in later sections. The MRM approach has 
another significant advantage. No assumptions are made about how the reward rates are obtained. The 
reward rates may be obtained by simulation, by solving a queuing network, or by solving a semi-Markov 
process (SMP), etc. 

It is easy to see that CTMCs are special cases of MRMs and therefore dependability analysis becomes a 
special case of performability analysis. In this section, we briefly show how various dependability measures 
can be analyzed as performability measures when the MRM has a special reward-rate assignment. Let 
{0(f), t > 0} be an MRM with state space ^ and constant reward rate r, associated with each state i of 
the CTMC. If the MRM spends r, units of time in state i, then r,Tj is the reward accumulated during this 
sojourn. Let Q be the generator matrix and P(f) be the state probability vector of the MRM. Here P»(f) 
denotes the transient probability of the MRM being in state i at time t . The transient behavior of this MRM 
is given by the Kolmogorov differential equation: 


dm 

dt 


= P(0Q ■ 


( 2 ) 
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given the initial state probability vector P(0). The steady-state probability vector 7 r (assuming that it exists 
and is unique) is obtained by setting the l.h.s. in Equation 2 to zero: 


TI-Q = 0 , ^2 TTi = 1 , 

»€* 


The cumulative state probability vector of the MRM is defined as L(tf) = Jjj P (x)dx y where L{(t) denotes the 
expected total time spent by the MRM in state i during the interval [0 } t). To compute L(£), we integrate 
Equation 2: 


dL(t) 

dt 


L(0Q + P(0) * 


The reward rate at time t for the MRM is given by T(i) = r©(t)- The accumulated reward over the 

interval [0 ,<) is given by: ^ ^ 

$(0 = / T (x)dx = f r e{x )dx . 

Jo Jo 

The expected reward rate at time t of the MRM is: 


£[T(0] = £nP,(0 • 

The expected reward rate in steady-state for the MRM is: 

£[t„] = £ r > ,r « • 

»e* 

To compute availability measures, the state-space of the MRM is partitioned into two: a set of system-up 
states, with reward rate 1, and a set of system-down states, with reward rate 0. We term this a 0-1 reward 
assignment The transient availability of the system is given by £[Y(<)] and steady-state availability is given 
by E[ T„]. 

The expected accumulated reward over the interval [0 , J) is: 


W)] = I>M0 . 

«€* 

The expected time-averaged reward rate over the interval [0 ,<) is given by r f Lj(t)/<. In an availability 
model with 0-1 reward assignment, the total uptime of the system over the interval [0,£) is £[$(<)]. Interval 
availability is the proportion of time a system is up in a given interval of time and it is given by E[$(t)]/t 
for the interval [0 y t). 

For MRMs with absorbing states, the state-space 'P is partitioned into two: (set of absorbing states) 

and ’Pt (set of transient states). Let Q t be the submatrix of Q corresponding to the transitions between 
transient states. The mean time spent by the MRM in state i E before absorption is given by r, = 
/ 0 °° Pi(x)dx } which is obtained by integrating Equation 2 from 0 to oo: 


tQt + Pt(0) = 0 . 
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The mean time to absorption is given by: 

MTTA = n • 

*€*t 

To compute reliability measures, all the system-down states are made absorbing states (transitions leaving 
from them are deleted). The same 0-1 reward assignment is used. The reliability is given by 2£[T(£)]. 
The lifetime (similar to total uptime) [20] of the system over the interval [0 } t) is £[$(£)] • The expected 
accumulated reward until absorption is: 

£[*(<»)] = r « r < • 

i€*r 

and the mean time to failure (MTTF) of the system is £J[$(oo)]. 

The distribution of the reward rate at time t } T (£), is computed as: 

P[T(t) <i>]= J2 p *« ■ 

r, < 

The distribution of accumulated reward until absorption or a finite period can also be computed. If the 
time to accumulate a given reward r is T(r), then the distribution of T(r) is known once the distribution of 
accumulated reward is known [32]: 

P[T(r) < t] = 1 - PM t) < r] . (3) 

For instance, the distribution of time to complete a job that requires r units of processing time on a system 
which is modeled by an MRM can be computed in this fashion. 

From the above discussion, it is clear that dependability analysis can be carried out using MRMs with 
special reward rate assignment to various system states. This analysis can also be carried out using CTMCs 
(without rewards) in an equally efficient manner. However, performability analysis, which can be easily 
carried out using MRMs, becomes cumbersome if rewards are not used. 

4 SPNs versus SRNs 

CTMCs modeling real systems tend to be large, sometimes with hundreds of thousands states. A higher- 
level specification mechanism is thus needed for the concise description of the model and the automatic 
conversion into a CTMC. Stochastic Petri nets (SPNs) provide such a mechanism. Molloy [48] used SPNs 
for performance analysis and showed that they are isomorphic to CTMCs. Since then, several extensions 
have been made to SPNs. Some of these extensions have enhanced the flexibility of use and allowed for even 
more concise description of performance and reliability models. Some other extensions have enhanced the 
modeling power by allowing for non-exponential distributions (see Section 5.3). 

In this section, we compare SPNs with and without rewards. Specifically, we compare SRNs as defined 
by Ciardo et al. [13] and GSPNs as defined by Ajmone-Marsan et al. [2]. SRNs are an extension of GSPNs, 
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B 

src 

c 

Figure 8: A simple network 

since they include all the features of GSPNs and add more features. There are several structural extensions 
such as guards (earlier known as enabling functions), priorities with timed transitions, marking-dependent 
arc cardinalities, and halting condition. Besides the structural extensions, a reward rate function associates 
a reward rate with each reachable marking. GSPNs and SRNs have been shown to be isomorphic to CTMCs 
and MRMs respectively. However, we show in this section that SRNs allow a much more concise description 
of system behavior than GSPNs. This is particularly true for dependability models. Furthermore, certain 
reward-based measures as described in Section 3 can be computed using SRNs but cannot be computed 
using GSPNs. 

To compare GSPNs and SRNs, we present an example. Consider a simple network between src and 
sink nodes consisting of three links (Figure 8). The network is operational as long as link A and at least 
one of the links B or C is operational. Assuming that each link has its independent repairperson, the 
availability of the network can be modeled by the GSPN shown in Figure 9. A token in places pA, pR, 
and pC respectively indicates that links A , R, and C are operational. A token in place pF implies that 
the network is failed. A token in place pR implies that, due to repair of one or more links, the component 
is ready to be operational again. The firing of transition iR removes the token from pF, signifying that 
the network is operational. The steady-state (transient) probability of a token being in place pF gives the 
steady-state (transient) unavailability of the network. 

The availability of this network can also be modeled by an SRN as shown in Figure 10. The reward rate 
function is as shown in the table. The expected value of reward rate r in steady-state (or at time t) gives 
the steady-state (transient) availability of the network. Let us now compare the GSPN and SRN models. A 
GSPN model requires a mesh of immediate transitions, places, and inhibitor arcs to capture the operational 
dependence of the network on the links. Part of this mesh captures the dependence such as the subsystem 
of links B and C fails only when both B and C have failed. The other part of the mesh captures the impact 
of repairs of links which reflect complementary conditions, such as removal of a token from place pBC as 
soon as either B or C is repaired. As the systems grow in complexity, this mesh becomes very complex 
and unwieldy. On the other hand, an SRN captures the operational dependence of the network on links by 
reward rate function. This results in a simpler and more manageable net. 
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Figure 9: GSPN availability model of the network 



Name 

Boolean Function 

bool a 
boolsc 
boolftW 

( #token$(pA ) == 1) 

( #tokens(pB ) == 1) V (#tokens(pC) == 1) 
bool a A boolsc 


Reward Rate Function 

if (booljvw — — 1) then r = 1 else r — 0 


Figure 10: SRN availability model of the network 
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5 Computational Problems 


In modeling practice, it is often the case that no single model is adequate to solve a problem. Different parts 
or levels of detail in a system may require different modeling techniques. In cases where a single model type 
can be used, it may be too large (a problem both for specification and analysis) or intractable (“stiff” or 
ill-conditioned). Three main difficulties in analytic models include largeness, stiffness, and the need to model 
non-exponential distributions. We explore these topics in the following subsections. 

5.1 Largeness 

The problem of model largeness can be handled in two ways: it can be avoided or it can be tolerated. 

5.1.1 Largeness Tolerance 

For the sake of simplicity we assume that the underlying model is a CTMC or an MRM. If we are prepared 
to store and solve the matrix of a large model, we should start with a concise description of the system 
model and provide for the automated generation and the solution of the underlying state space. A number 
of approaches have evolved for such specifications. Haverkort and Trivedi [24] summarize these approaches. 
They present seven different classes of specification techniques: Stochastic Petri nets and their variants, 
Communicating processes, Queueing networks, Specialized languages, Fault-trees, Production rule systems, 
and Hybrid techniques. We refer the reader to the cited paper for further details. 

5.1.2 Largeness avoidance 

If the size of the underlying CTMC (or MRM) is so large as to preclude generation and storage, we must 
resort to approximations that avoid the large underlying model. State truncation, lumping, decomposition 
and fluid models constitute the types of approximations that have been utilized. We discuss these four 
approaches below. 

Truncation For many practical systems, the exact number of structural states in a corresponding model 
might be extremely large, or even infinite. State-space based approaches, then, cannot be applied directly 
to the model. In many cases, though, the system spends most of the time in a small subset of the entire 
state space; most states have an extremely small probability. 

This is particularly true of highly reliable systems: if a system has K components, and if each component 
fails with a very small rate (as is normally the case), states with more than a handful of failed components 
are rarely reached. Indeed, it is common practice in reliability modeling to stop the state-space exploration 
after k <C K failures, with the implicit assumption that states with ik + 1 or more failed components have 
negligible probability. This is just one example of state truncation. 

As an example, consider a AT-processor system, where nodes fail and are repaired with rate A and 
respectively. We want to compute the expected cumulative computational capacity during the time interval 


13 



0C©C£iC---C ©C— C0C0 


Figure 11: A typical model that can be truncated. 
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Figure 12: Strict truncation, 

[0 ,£), C(t) } that is, the expected number of non-failed processors as a function of time, integrated between 
0 and t . If the state is characterized by the number of working processors, the model corresponds to a 
birth-death process with state space {A, K - 1, . . .0} (Figure 11). If the processors have different failure 
and repair behaviors, the identity of the failed processors must be recorded in the state and the size of the 
state space grows, dramatically, from K + 1 to 2 K . 

Formally, given a reachability graph (*S,>1), a state truncation results in a truncated reachability graph 

(S\A'). 

If (S', .4') is a subgraph of (5,4), the exact state-space exploration algorithm, or the model, is simply 
modified to ignore certain arcs which lead to states in S \ S' . In our example, we can prevent a k -1- 1-th 
failure in a state which already has k failed components. We call this case “strict truncation” (Figure 12). 

Alternatively, (S', *4') might be composed by a subgraph of («S,.A), augmented with one or more states 
and arcs. In our example, we might add a new state u (for unknown), and an arc from each state with k 
failed components to u, corresponding to further failures of the non-failed components. Strictly speaking, 
this is more an “aggregation”, so we call this approach an “aggregation truncation” (Figure 13). 

The two approaches often allow us to obtain upper and lower bounds on the measure of interest. In our 
example, we can solve the two CTMCs of Figures 12 and 13, obtaining two transient probability vectors: 

and 

7T 0 (<) = [TT a K (t), . . . Vl{- k (t), <(<)] 
respectively. If we associate the reward rates 

P s ~ [p*k = K* • * -PK-k = A — A:] 



Figure 13: Aggregation truncation. 
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and 

p a = [p a K = K,...p a K _ k = K-k,p a u = 0 ] 

with the states of the two CTMCs, we obtain the inequalities 

C’(t) = £ = c°(t) 

ie{K y ...K- it} 

If we are interested in the expected instantaneous computational capacity in steady state, c, that is, the 
expected number of non-failed processors in the long run, the CTMC in Figure 12 still offers an upper bound, 
but the one in Figure 13 is of no use, since state u has probability one in steady state, which would simply 
result in the trivial lower bound 0 for c. In any case, our ability to obtain useful bounds is normally tied to 
our a priori knowledge of aspects of the CTMC structure and values of the reward rates. In our example, 
we can prove that C s (t) is an upper bound on C(2) because we know that 

• Removing the set of states {K - k - 1, . . .0} does not decrease the probability of any of the states in 

• The maximum reward rates of the states in {K — k — 1, . . .0} is not larger than the minimum reward 
rates of the states in {K , . . . K — k} . 

and we can prove that C a (t ) is a lower bound on C(t) because we know that 

• Aggregating the set of states {K — k — 1, . . .0} into a single absorbing state u does not increase the 
probability of any of the states in {K } . . . K — k} 

• The minimum reward rates of the states in {K — k — 1, . . .0} is not smaller than 0, the reward rate of 
states u . 

For steady state analysis, more sophisticated arguments based on [17] can be used [49]. We conclude by 
observing that simulation is, in a probabilistic sense, a form of automatic truncation, since the most likely 
states are visited frequently while unlikely states may not be visited at all. 

Lumping Most complex systems (models) consist of a large set of systems (submodels), many of them of 
the same type. The state of the system is then obtained by composing the state of each subsystem. When 
performing state-space exploration, though, there are simplifications which might lead to a smaller state 
space while still allowing an exact solution. For example, in our system with K processors, we could model 
each of them as an independent subsystem which can be in one of two states, up or down. The entire system 
can then be viewed as composed of I\ such subsystems, thus having 2 K states. This approach is wasteful, 
though, since it is not necessary to distinguish between processors, if they all have the same failure and 
repair behavior. Rather, we can represent the state of the system as the number of subsystems in each state 
(up or down , but, since the total number of processors is known, we can simply remember the number of up 
processors). 
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This application of lumping [30, 53] is indeed so natural that we used it in conjunction with truncation, 
without even justifying its adoption. In real systems, though, the reachability graph of a subsystem might 
be quite complex. The general algorithm to obtain the lumped state space for a system consisting of K 
independent subsystems can be easily expressed making use of SPNs [13] (see also [29] for an example of use 
of this algorithm); 

1. Generate the reachability graph for a single subsystem. Markings and arcs are labeled with the number 
of tokens in each place and the name of the corresponding transition, respectively. 

2. Transform the reachability graph into a SPN: for each marking i, add a place p,, initially empty; for 
each arc from state i labeled by transition f, add a transition £, with marking-dependent rate equal 
#(p,) times the rate of in marking i for a single subsystem, an input arc from p, to and an output 
arc. from ti to pj (#(/>;) is the number of tokens in place p, ). 

3. Set the initial marking of the SPN: for each subsystem, if its initial state is i, add a token in p, . Note 
that the subsystems can start in a different initial state without affecting the correctness of lumping. 

4. Generate the CTMC underlying this SPN. 


Figure 14 shows the application of the algorithm to a system composed of K dual-redundant subsystems, 
where repair is initiated only when both units have failed. Each subsystem is described by a SPN whose 
reachability graph has four markings. If no lumping is applied, the total number of states is A K . The 
application of our algorithm, instead, results in a SPN with ( K -f 3)(/f + 2)(K -|- l)/6 states. 

In general, if there are K subsystems with N states each, the size of state space with and without lumping 


is 


N k = N x ■ ■ ■ x N 

v 

K terms 


( N + K-l \ N+K - 1 N + IN 

V K ) ”, K X "' X 2 1 

N 

K terms 


Each of the K terms in the second case is smaller than N , with the exception of the last one, which is N> 
so this approach is always guaranteed to reduce the size of the state space. The reduction is particularly 
sizable when N is small and K is large: for example, when N = 2 we have 2 K vs. K - 1-1. 

In practice, the submodels have some interaction, so independence does not hold. If the interaction is 
limited to a “rate dependence” [14] where the transition rates in a subsystem depend on the number of 
subsystems in certain states, but not on their identity, the algorithm can still be applied: only a different 
specification of the firing rates for the resulting SPN is needed. In our example, the repairperson could be 
a shared resource, so the rate of transition repair in each subsystem could be A// 1,1 , where / is the total 
number of subsystems being repaired, and the exponent 1.1 models the inherent inefficiency due to resource 
sharing. The rate of transition repairioio in the resulting SPN should then be specified as \/#(p\oio) 1 S 
where #(pioio) indicates the number of tokens in pioio or, in other words, /. 

Other types of dependence are structural: often, tokens might have to move from a submodel to another 
portion of the global model. With some care, lumping might still be possible [57]. 
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Composition In this approach, the overall model is composed of a set of submodels. Construction and 
generation of a large model is avoided and the solution is obtained by interactions among the submodels. 
Interactions imply exchange of information between the submodels. Reward based performability analysis 
[44, 63] is an example of composition of reliability and performance models. The performance submodel is 
solved and its results are passed a s reward rates to the reliability submodel. In general, quantities such as 
probability distributions, mean, variance, or numerical values of reliability and availability are exchanged 
among submodels. 

Other examples of composition include flow-equivalent server approximation introduced by Chandy et 
al [10], behavioral decomposition used in the software tool HARP [21], composition of GSPNs and queuing 
networks proposed by Balbo et al [3], and hybrid hierarchical composition employed in the software tool 
SHARPE [56]. These approaches can be classified as hierarchical composition techniques. Hierarchical 
composition approaches differ not only in the way the model is constructed but also in the way the model 
is solved. The set of submodels can be solved iteratively using a fixed-point iteration scheme (a cyclic 
dependence exists among the submodels) [12, 14, 47, 61] or in anon-iterative fashion (a strict hierarchy exists 
among the submodels) [43, 56]. For a unified view of these seemingly different approaches to hierarchical 
composition, refer to [42]. 

Fluid Models As the number of tokens in a place or the number of jobs in a queue becomes large, the size 
of the underlying CTMC grows. It may be possible to approximate the number of tokens in the place, or 
the number of jobs in the queue, as a non-negative real number. It is then possible to write the differential 
equations for the dynamic behavior of the model and, in some cases, provide solution. Mitra has developed 
models along these lines [46]. More recently, Kulkarni and Trivedi have proposed fluid stochastic Petri nets 
(FSPNs) [33]. 
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5.2 Stiffness 


CTMC stiffness is a computational problem which adversely affects the stability, accuracy, and efficiency of 
a numerical solution method unless that method has been specially designed to handle it. CTMC stiffness 
is caused by extreme disparity between transition rates. In a reliability model, repair rates could be 10 6 
times the failure rates. In a monolithic performability model, the job arrival rates could be 10 9 times the 
component failure rates. In this section, we discuss how stiffness can be overcome. To begin with, we describe 
how the extreme disparity between transition rates translates into a computational problem for numerical 
solution methods. 

Let us consider the linear system of differential equations in Equation 2. This system is considered stiff 
if the solution has components whose rates of change (decay or gain) differ greatly. The rate of change of 
each solution component is governed by the magnitude of an eigenvalue of the generator matrix Q . This 
system is considered stiff if for i = 2, m, /?e(A t ) < 0 and 

max|/^e(Ai)| >> min|/?e(A,)| , 
i i 

where A* are the eigenvalues of Q. The rate of change of a solution component is defined relative to the 
solution interval, hence Miranker [45] gave the following definition: "a system of differential equations is said 
to be stiff in the interval [0, t) if there exists a solution component of the system which has variation in that 
interval that is large compared to 1/t” . However, the CTMC attains numerical steady-state at some finite 
time t 99 : within the specified accuracy (or error tolerance) the state probability vector does not change with 
increase in time. Hence we may redefine stiffness: “the system of differential equations in Equation 2 is 
said to be stiff in the interval [0 } t) if there exists a solution component of the system which has variation in 
that interval that is large compared to 1/ min{£, 1 99 }. The large difference in transition rates of the CTMC 
approximately translates into large difference in magnitude of the eigenvalues of the generator matrix. 

Stiffness could cause numerical instability and make the solution methods inefficient if the methods are 
not designed to handle stiffness. Like largeness, two basic approaches to overcome stiffness are: stiffness 
avoidance or stiffness tolerance. 

5.2.1 Stiffness Avoidance 

According to this approach, stiffness is eliminated from a model by applying some approximation scheme. 
This results in a set of non-stiff models which are then solved to obtain the overall solution. Bobbio and 
Trivedi [8] have designed one such technique based on aggregation. Most of these approaches avoid largeness 
as well, since some kind of model decomposition or aggregation is involved. 

5.2.2 Stiffness Tolerance 

Special solution methods that are designed to handle stiffness are used in this approach. The two most 
commonly used methods for transient analysis of CTMCs are uniformization and numerical ODE solution 
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methods. It has been shown [54, 40] that uniformization is inefficient for stiff CTMCs. A modified implemen- 
tation of uniformization which incorporates steady-state detection of the underlying discrete-time Markov 
chain (DTMC) [50] was shown to be more efficient than the standard implementation when the solution 
interval was larger than t ss . However, uniformization remains much more inefficient than L-stable ODE 
methods [38]. L-stable ODE methods [34] are recommended for stiff CTMCs. Among these, second-order 
TR-BDF2 [54] is efficient for low accuracy requirements and third order implicit Runge-Kutta method [40] 
is efficient for high accuracy requirements. Recently, more efficient methods based on stiffness detection [37] 
have been proposed. 

5.3 Non-exponential distributions 

5.3.1 Phase Approximations 

The basic methodology of phase approximations is to replace a non-exponential distribution in a model by 
a set of states and transitions between those states such that the holding time in each state is exponentially 
distributed. This follows from Cox [18], who showed that any non-exponential probability distribution with 
rational Laplace Steiltjes transform (LST) can be represented by a series of exponential stages with complex 
valued transition rates. Each stage is entered with some probability and exited (the process stops) with 
complementary probability. However, conditions to determine whether the resulting function is a proper 
cdf or not are not known. To overcome this problem, Neuts [52] restricted the Coxian representation by 
defining phase type distributions as absorbing-time distributions of a CTMC with at least one absorbing 
state. Non-exponential distributions can be approximated by phase type distributions (also known as phase 
approximations when used in this context). Distributions without rational LSTs can be approximated by 
distributions having rational LSTs, although, arbitrarily close approximations may require a CTMC with a 
large state space. 

A complete approach to phase approximations is discussed in [39]. This approach consists of a few basic 
steps: 

• Selecting a phase approximation class for a given distribution. One of the most commonly used phase 
approximation classes is a mixture of Erlang distributions [9]. It has been used in [26, 39, 60] and 
good fits to some commonly occurring distributions such as Weibull, deterministic, lognormal, and 
uniform have been obtained. Schmickler [58] has used mixtures of Erlang distributions to fit empirical 
functions. Bobbio et al. [6, 4, 5] have used a different kind of acyclic phase approximation and obtained 
good fits to several distributions. 

• Obtaining the parameters of phase approximations . Once a suitable phase approximation has been 
chosen for a given distribution (which may be in empirical form), the next step is to fit the parameters 
of this phase approximation. The choices include moment matching, function (cdf or pdf) fitting, 
maximum likelihood estimation (in case of empirical distributions), or a combination of these [9, 39]. 
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Johnson and Taffe [27, 28] have considered matching the first three moments of mixtures of two Erlang 
distributions. For more references on this topic, refer to [7]. 

• Generation of the overall CTMC. After the parameters of phase approximations for all the non- 
exponential distributions have been fitted (or estimated), the overall CTMC is generated. This may 
require the cross-product of phase approximations [39]. 

A few software packages implementing this approach have been developed. Phase approximations were 
used in the SURF package [16], although SURF was intended only for a restricted class of reliability models. 
Cumani [19] has designed the software package ESP for evaluation of SPNs with phase-type distributed firing 
times. Phase approximations for a class of non-Markovian models have been implemented in GSHARPE 
[39]. GSHARPE is a front end for a general purpose performance and reliability modeling toolkit called 
SHARPE [56]. It accepts a non-Markovian model and converts it into a CTMC in SHARPE syntax after 
applying phase approximations. 

5.3.2 Non-homogeneous CTMCs 

If transition rates in a CTMC are allowed to be time-dependent, where time is measured from the beginning of 
system operation, the model becomes a non-homogeneous CTMC. Such models are used in software reliability 
under the name of NHPP (Non-Homogeneous Poisson Process) [51] and in hardware reliability models of 
non-repairable systems [22]. Tools such as CARE III and HARP allow component failure distributions to 
be Weibull using this approach. 

5.3.3 Markov regenerative processes (MRGPs) 

The use of non-homogeneous CTMC allows transition rates to be globally time-dependent while the use 
of SMPs allow the time dependence to be local (since the entry into the state). Both of these are often 
inadequate in practice. While, in principle, the phase approximations allow more general time dependence, 
their practical usefulness is limited by the increased size of the underlying stochastic process, which further 
exacerbates the largeness problem. MRGPs seem to provide a useful time-dependence that can capture 
many interesting practical scenarios. The basic idea is that not every state change is required to be a 
regeneration point. Thus, in a multi-component system with each component having exponential time-to- 
failure distribution and a generally distributed repair with a single repairperson (FCFS), the underlying 
stochastic process is a MRGP (but not a SMP or a CTMC). Recent work on this topic can be found in 
[11, 15]. 
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6 Conclusion 


We discussed several types of modeling techniques used in dependability and performability analysis, with 
a particular emphasis on approaches based on the (entire or partial) generation of the state-space. 

The common underlying formalisms we consider, continuous-time Markov chains (CTMCs) and Markov 
reward models (MRMs), are capable of modeling a large class of systems, but they result in large models, 
difficult to describe and analyze. The description problem is solved by using higher-level formalisms, such 
as reliability graphs, fault trees, queueing networks, generalized stochastic Petri nets, and stochastic reward 
nets. With the appropriate software modeling tools, these can then be automatically translated into CTMCs 
or MRMs. 

The solution problem, though, remains, since the size of the underlying stochastic process grows combi- 
natorially. In addition, when modeling activities with very different time-scales, such as failure and repair 
of components, and performance-related behavior, such as arrival and departure of jobs, stiffness arises. 
Advanced numerical techniques, and exact or approximate approaches such as truncation, aggregation, com- 
position, and fluid models, can then be effectively used to obtain numerical solutions. 
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